-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve ollama workflow from CI #53
Conversation
It'd be great to run this in CI, however the smaller model sizes end up very flakey on consistently using tools. But using larger model sizes makes this take a long time. I think we can revisit this when we get a more consistent small tool use model
I'm not sure the flake is something we want to skip, though. Both failures weren't due to the LLM, but something causing it to look for the openai model, not the model that's passed. This seems like something to investigate right?
|
Yup you're totally right - that's where i went first too. The changes to test_integration.py here fix this problem, which was showing up in some PRs. Now that it's fixed though, i'm finding locally when running small models that the test_tools function fails pretty often:
We could consider trying to pass through a seed and temperature to get more consistency here though! |
so do we feel deleting the tests is better than using a larger model? like qwen2.5 or just not override it and use mistral-nemo? larger ones take 4-5m total, but otoh it is more telling about the integration story. Reason I ask is that all models can hallucinate or not give a precise response, just the likelihood is less. I had chatted with @michaelneale about this thing and if it isn't better to use a higher level retry concept than http 5xx, etc. and then handle the unexpected results. Deleting the test doesn't change this is what I mean, and we could at least learn if the default model is problematic routinely |
I am ok with either temporarily disabling, or making it the larger one (if it is worth it) - I guess @baxen may be interested, is it worth 4m concurrent test run for larger LLM to test things a bit more end to end? |
the other food for thought is that if/when we have an integration test in goose, possibly there's less pressure here. Just it will still be that 5m overhead to run that. Personally, I have projects whose integration tests take a lot longer than 5m, but otoh that doesn't make this part acceptable. |
can also do integration tests with a GH secret and api call, but if it can work in similar time with local is "nice"? |
only from protected branches though right (not forks as that would let folks steal the secrets)? I guess since most changes are from the same team, testing public providers on these type of branches is better than not testing them. That said, I think we can explore that independent of local tests which eventually need to be solved even if it isn't prioritized vs I guess openai. |
yeah ideally wouldn't - GH redacts secrets etc but no way to be 100% sure with forks (and have to bless runs anyway) |
is there a way to run this without it holding back the checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to removing this as seems more are interested in this, and it conserves time for other things that we need to look at. Plus it prevents some changes I have from being blocked on a decision here. We can revisit this later and meanwhile ad-hoc test.
#54 swaps back to the default model which we can use until/if we merge deleting the workflow |
593b9cf
to
9ee368f
Compare
@codefromthecrypt @michaelneale take a look at the updates? I think - let's confirm - that setting the seed like this might fix the issue while still letting us use the small and fast model |
I think problem was more that summarizer would default to gpt4o-mini - was there also some additional flake? if so, yeah this looks good as well |
9ee368f
to
f5a693b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I think good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coolio!
Co-authored-by: Adrian Cole <[email protected]>
yes, this works well: #59 I tested it on top of @lukealvoeiro's changes and seems good - I think we can move ahead with this. |
thanks @baxen! |
It'd be great to run this in CI, however the smaller model sizes end up very flakey on consistently using tools. But using larger model sizes makes this take a long time. I think we can revisit this when we get a more consistent small tool use model